8 research outputs found

    String Covering: A Survey

    Full text link
    The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future

    Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

    Get PDF
    V-order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V-comparison of strings can be done in linear time and constant space. V-order has been proposed as an alternative to lexicographic order (lexorder) in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform (BWT). In line with the recent interest in the connection between suffix arrays and Lyndon factorization, in this paper we obtain similar results for the V-order factorization. Indeed, we show that the results describing the connection between suffix arrays and Lyndon factorization are matched by analogous V-order processing. We also describe a methodology for efficiently computing the FM-Index in V-order, as well as V-order substring pattern matching using backward search

    Practical KMP/BM Style Pattern-Matching on Indeterminate Strings

    Full text link
    In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern p=p[1..m]p = p[1..m] in an indeterminate string x=x[1..n]x = x[1..n], where both pp and xx are defined on a "small" ordered alphabet Σ\Sigma −- say, σ=∣Σ∣≤9\sigma = |\Sigma| \le 9. Both algorithms depend on a preprocessing phase that replaces Σ\Sigma by an integer alphabet ΣI\Sigma_I of size σI=σ\sigma_I = \sigma which (reversibly, in time linear in string length) maps both xx and pp into equivalent regular strings yy and qq, respectively, on ΣI\Sigma_I, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for σ≤4\sigma \le 4, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of qq in yy (that is, of pp in xx), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern qq along the transformed string yy. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: Θ(m)\Theta(m) words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore
    corecore